Cleaning: Flight Delays

This data set comes from the featured Kaggle datasets, containing three tables that relate to commercial airline flights. The flight delay and cancellation data was collected and published by the DOT's Bureau of Transportation Statistics.

There are three tables:

airlines : contains airline IATA codes with corresponding airline names
airports : contains IATA_CODE: Location Identifier String

AIRPORT: Airport's Name String

CITY: City Name of the Airport String

STATE: State Name of the Airport String

COUNTRY: Country Name of the Airport String

LATITUDE: Latitude of the Airport Numeric

LONGITUDE: Longitude of the Airport Numeric
flights : contains many features with descriptions that can be found here, by selecting flights.csv in the data and clicking on Edit descriptions above the data preview.

1. Convert column names to lower case (it's a pain to keep typing upper case column names)



In [3]:

    
import pandas as pd
import numpy as np

flights = pd.read_csv('flights/flights_sm_raw.csv')
airlines = pd.read_csv('flights/airlines.csv')
airports = pd.read_csv('flights/airports.csv')



In [4]:

    
f_names = [name.lower() for name in list(flights.columns)]
l_names = [name.lower() for name in list(airlines.columns)]
p_names = [name.lower() for name in list(airports.columns)]

flights.columns = f_names
airlines.columns = l_names
airports.columns = p_names



In [5]:

    
flights.head()









    Out[5]:






  
    
      
      year
      month
      day
      day_of_week
      airline
      flight_number
      tail_number
      origin_airport
      destination_airport
      scheduled_departure
      ...
      arrival_time
      arrival_delay
      diverted
      cancelled
      cancellation_reason
      air_system_delay
      security_delay
      airline_delay
      late_aircraft_delay
      weather_delay
    
  
  
    
      0
      2015
      1
      1
      4
      US
      2013
      N584UW
      LAX
      CLT
      30
      ...
      753.0
      -10.0
      0
      0
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      1
      2015
      1
      1
      4
      AA
      1674
      N853AA
      LAS
      MIA
      35
      ...
      753.0
      -10.0
      0
      0
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      2
      2015
      1
      1
      4
      DL
      1560
      N3743H
      ANC
      SEA
      45
      ...
      451.0
      -24.0
      0
      0
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      3
      2015
      1
      1
      4
      AA
      371
      N3GXAA
      SEA
      MIA
      100
      ...
      939.0
      1.0
      0
      0
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      4
      2015
      1
      1
      4
      AA
      115
      N3CTAA
      LAX
      MIA
      105
      ...
      839.0
      -12.0
      0
      0
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
  

5 rows × 31 columns



In [6]:

    
airlines.head()









    Out[6]:






  
    
      
      iata_code
      airline
    
  
  
    
      0
      UA
      United Air Lines Inc.
    
    
      1
      AA
      American Airlines Inc.
    
    
      2
      US
      US Airways Inc.
    
    
      3
      F9
      Frontier Airlines Inc.
    
    
      4
      B6
      JetBlue Airways



In [7]:

    
airports.head()









    Out[7]:






  
    
      
      iata_code
      airport
      city
      state
      country
      latitude
      longitude
    
  
  
    
      0
      ABE
      Lehigh Valley International Airport
      Allentown
      PA
      USA
      40.65236
      -75.44040
    
    
      1
      ABI
      Abilene Regional Airport
      Abilene
      TX
      USA
      32.41132
      -99.68190
    
    
      2
      ABQ
      Albuquerque International Sunport
      Albuquerque
      NM
      USA
      35.04022
      -106.60919
    
    
      3
      ABR
      Aberdeen Regional Airport
      Aberdeen
      SD
      USA
      45.44906
      -98.42183
    
    
      4
      ABY
      Southwest Georgia Regional Airport
      Albany
      GA
      USA
      31.53552
      -84.19447

2. Remove all rows that have null values for critical features

critical features are those that are needed to compute missing values
you need to determine the sets of features that are go together for computing values



In [8]:

    
flights.iloc[0:5, 9:23]









    Out[8]:






  
    
      
      scheduled_departure
      departure_time
      departure_delay
      taxi_out
      wheels_off
      scheduled_time
      elapsed_time
      air_time
      distance
      wheels_on
      taxi_in
      scheduled_arrival
      arrival_time
      arrival_delay
    
  
  
    
      0
      30
      44.0
      14.0
      13.0
      57.0
      273.0
      249.0
      228.0
      2125
      745.0
      8.0
      803
      753.0
      -10.0
    
    
      1
      35
      27.0
      -8.0
      21.0
      48.0
      268.0
      266.0
      238.0
      2174
      746.0
      7.0
      803
      753.0
      -10.0
    
    
      2
      45
      31.0
      -14.0
      25.0
      56.0
      210.0
      200.0
      171.0
      1448
      447.0
      4.0
      515
      451.0
      -24.0
    
    
      3
      100
      52.0
      -8.0
      30.0
      122.0
      338.0
      347.0
      311.0
      2724
      933.0
      6.0
      938
      939.0
      1.0
    
    
      4
      105
      103.0
      -2.0
      14.0
      117.0
      286.0
      276.0
      255.0
      2342
      832.0
      7.0
      851
      839.0
      -12.0

We have this features:

YEAR: critical
MONTH: critical
DAY: critical
DAY_OF_WEEK: may be derived from year, month and day
AIRLINE: critical
FLIGHT_NUMBER: critical
TAIL_NUMBER: critical
ORIGIN_AIRPORT: critical
DESTINATION_AIRPORT: critical
SCHEDULED_DEPARTURE: critical, timepoint
DEPARTURE_TIME = WHEELS_OFF - TAXI_OUT: critical, timepoint; we'll derive taxi_out from here
DEPARTURE_DELAY = DEPARTURE_TIME-SCHEDULED_DEPARTURE
TAXI_OUT: derived from departure_time and wheels_off
WHEELS_OFF: critical, timepoint
SCHEDULED_TIME = SCHEDULED_ARRIVAL-SCHEDULED_DEPARTURE: critical cannot calculate because of timezones
ELAPSED_TIME = AIR_TIME+TAXI_IN+TAXI_OFF
AIR_TIME = WHEELS_ON - WHEELS_OFF: I can theoretically calculate it but the calculation won't be precise because of timezones
DISTANCE: critical
WHEELS_ON: critical, timepoint
TAXI_IN: derived from arrival_time and wheels_on
SCHEDULED_ARRIVAL: critical, timepoint
ARRIVAL_TIME = WHEELS_ON+TAXI_IN: critical, timepoint
ARRIVAL_DELAY = ARRIVAL_TIME-SCHEDULED_ARRIVAL
DIVERTED: critical
CANCELLED: critical
CANCELLATION_REASON: critical
AIR_SYSTEM_DELAY: ?
SECURITY_DELAY: ?
AIRLINE_DELAY: ?
LATE_AIRCRAFT_DELAY: ?
WEATHER_DELAY: ?

The count column gives an idea of missing values for the following features:



In [9]:

    
flights.iloc[:, 9:23].describe().transpose()









    Out[9]:






  
    
      
      count
      mean
      std
      min
      25%
      50%
      75%
      max
    
  
  
    
      scheduled_departure
      662804.0
      1334.697796
      485.897622
      1.0
      919.0
      1330.0
      1735.0
      2359.0
    
    
      departure_time
      576651.0
      1335.741712
      496.812987
      1.0
      921.0
      1330.0
      1740.0
      2400.0
    
    
      departure_delay
      576651.0
      9.665097
      37.730697
      -45.0
      -5.0
      -2.0
      7.0
      1521.0
    
    
      taxi_out
      573757.0
      16.055921
      8.847659
      1.0
      11.0
      14.0
      19.0
      185.0
    
    
      wheels_off
      573757.0
      1356.680891
      498.284208
      1.0
      935.0
      1342.0
      1754.0
      2400.0
    
    
      scheduled_time
      662798.0
      139.527372
      74.158283
      18.0
      85.0
      120.0
      170.0
      718.0
    
    
      elapsed_time
      571395.0
      136.861871
      74.173957
      15.0
      82.0
      118.0
      168.0
      718.0
    
    
      air_time
      571395.0
      113.388771
      72.182318
      7.0
      60.0
      94.0
      144.0
      669.0
    
    
      distance
      662804.0
      802.052239
      598.923492
      21.0
      363.0
      632.0
      1037.0
      4983.0
    
    
      wheels_on
      572641.0
      1470.117538
      522.510025
      1.0
      1054.0
      1507.0
      1911.0
      2400.0
    
    
      taxi_in
      572641.0
      7.432660
      5.628520
      1.0
      4.0
      6.0
      9.0
      175.0
    
    
      scheduled_arrival
      662804.0
      1496.783230
      509.641339
      1.0
      1109.0
      1523.0
      1922.0
      2400.0
    
    
      arrival_time
      572641.0
      1475.421535
      526.463841
      1.0
      1058.0
      1511.0
      1916.0
      2400.0
    
    
      arrival_delay
      571395.0
      4.404344
      39.078085
      -82.0
      -13.0
      -5.0
      8.0
      1499.0

I'm selecting the critical features among the ones above and deleting all rows which have null values for them:



In [11]:

    
critical = ['scheduled_departure', 'departure_time', 'wheels_off', 'scheduled_time',
            'wheels_on', 'scheduled_arrival', 'arrival_time']

for i in range(7):
    flights.drop(flights[flights[critical[i]].isnull()].index, inplace=True)

len(flights)









    Out[11]:





572641

The solution goes like this:



In [ ]:

    
cancelled = flights[flights.cancelled == 1].copy()
not_cancelled = flights[flights.cancelled == 0].copy()



In [ ]:

    
subset1 = ['departure_time', 'taxi_out', 'wheels_off']
subset1 = ['scheduled_departure', 'departure_delay']
subset1 = ['wheels_off', 'air_time', 'wheels_on']
subset1 = ['taxi_in', 'wheels_on', 'arrival_time']
subset1 = ['scheduled_arrival', 'arrival_delay']

from itertools import combinations

for subset in [subset1, subset3, subset4]:
    for c in combinations(subset, 2):
        not_cancelled.dropna(how='all', subset=list(c), inplace=True)

for subset in [subset2, subset5]:
    not_cancelled.dropna(how='all', subset=subset, inplace=True)

# sample is for randomizing order
data = pd.concat([cancelled, not_cancelled]).sample(frac=1).reset_index(drop=True)

3. Convert times from float values to full 24-hour time format strings



In [12]:

    
times = ['scheduled_departure', 'departure_time', 'wheels_off', 'wheels_on', 'scheduled_arrival', 'arrival_time']
for i in range(6):
    flights[times[i]] = flights[times[i]].apply(lambda x: ('0000' + str(int(x)))[-4:-2] + ':' + ('0000' + str(int(x)))[-2:] + ':00')

flights[times].head()









    Out[12]:






  
    
      
      scheduled_departure
      departure_time
      wheels_off
      wheels_on
      scheduled_arrival
      arrival_time
    
  
  
    
      0
      00:30:00
      00:44:00
      00:57:00
      07:45:00
      08:03:00
      07:53:00
    
    
      1
      00:35:00
      00:27:00
      00:48:00
      07:46:00
      08:03:00
      07:53:00
    
    
      2
      00:45:00
      00:31:00
      00:56:00
      04:47:00
      05:15:00
      04:51:00
    
    
      3
      01:00:00
      00:52:00
      01:22:00
      09:33:00
      09:38:00
      09:39:00
    
    
      4
      01:05:00
      01:03:00
      01:17:00
      08:32:00
      08:51:00
      08:39:00

4. Convert new hour strings to datetime objects

Test the function and determine the cause of (and fix) any errors

The cause is that some columns have 24:00:00 as a value; I'm guessing they mean midnight of the subsequent day so I'm creating a datetime value including the day and I'm adding one day to this cases.



In [19]:

    
# hour must be in 0..23
for col in times:
    print(col)
    print(flights[flights[col] > '23:59:59'][col].unique())









    



scheduled_departure
[]
departure_time
['24:00:00']
wheels_off
['24:00:00']
wheels_on
['24:00:00']
scheduled_arrival
[]
arrival_time
['24:00:00']



In [20]:

    
def convert_dt(year, month, day, time):
        if time == '24:00:00':
            timestamp = pd.to_datetime(str(year) + '-' + str(month) + '-' + str(day) + ' ' + '00:00:00') + pd.Timedelta(1, unit='D')
        else:
            timestamp = pd.to_datetime(str(year) + '-' + str(month) + '-' + str(day) + ' ' + time)
        return timestamp

for i in range(6):
    flights[times[i]] = flights.apply(lambda x: convert_dt(x['year'], x['month'], x['day'], x[times[i]]), axis=1)



In [ ]:

    
# solution for ttesting the function:
c = flights.columns.get_loc['departure_time'] + 1
_ = [hour_to_datetime(r, c) for r in data.itertuples()]



In [21]:

    
flights[times].head()









    Out[21]:






  
    
      
      scheduled_departure
      departure_time
      wheels_off
      wheels_on
      scheduled_arrival
      arrival_time
    
  
  
    
      0
      2015-01-01 00:30:00
      2015-01-01 00:44:00
      2015-01-01 00:57:00
      2015-01-01 07:45:00
      2015-01-01 08:03:00
      2015-01-01 07:53:00
    
    
      1
      2015-01-01 00:35:00
      2015-01-01 00:27:00
      2015-01-01 00:48:00
      2015-01-01 07:46:00
      2015-01-01 08:03:00
      2015-01-01 07:53:00
    
    
      2
      2015-01-01 00:45:00
      2015-01-01 00:31:00
      2015-01-01 00:56:00
      2015-01-01 04:47:00
      2015-01-01 05:15:00
      2015-01-01 04:51:00
    
    
      3
      2015-01-01 01:00:00
      2015-01-01 00:52:00
      2015-01-01 01:22:00
      2015-01-01 09:33:00
      2015-01-01 09:38:00
      2015-01-01 09:39:00
    
    
      4
      2015-01-01 01:05:00
      2015-01-01 01:03:00
      2015-01-01 01:17:00
      2015-01-01 08:32:00
      2015-01-01 08:51:00
      2015-01-01 08:39:00

5. Fill in missing values that can be calculated from subsets (our critical features)

These are the features I can calculate and the count column indicates which ones need to be filled:



In [22]:

    
calculated = ['departure_delay', 'taxi_out', 'elapsed_time', 'taxi_in', 'arrival_delay', 'air_time']
flights[calculated].describe().transpose()

# departure_delay = departure_time - scheduled_departure
# taxi_out: wheels_off - departure_time
# elapsed_time = air_time + taxi_in + taxi_out
# taxi_in: arrival_time - wheels_on
# arrival_delay = arrival_time - scheduled_arrival
# air_time = wheels_on - wheels_off









    Out[22]:






  
    
      
      count
      mean
      std
      min
      25%
      50%
      75%
      max
    
  
  
    
      departure_delay
      572641.0
      9.326114
      36.758575
      -45.0
      -5.0
      -2.0
      7.0
      1521.0
    
    
      taxi_out
      572641.0
      16.050209
      8.834925
      1.0
      11.0
      14.0
      19.0
      185.0
    
    
      elapsed_time
      571395.0
      136.861871
      74.173957
      15.0
      82.0
      118.0
      168.0
      718.0
    
    
      taxi_in
      572641.0
      7.432660
      5.628520
      1.0
      4.0
      6.0
      9.0
      175.0
    
    
      arrival_delay
      571395.0
      4.404344
      39.078085
      -82.0
      -13.0
      -5.0
      8.0
      1499.0
    
    
      air_time
      571395.0
      113.388771
      72.182318
      7.0
      60.0
      94.0
      144.0
      669.0



In [54]:

    
to_fill = flights[flights['air_time'] != flights['air_time']]
diff = (to_fill['wheels_on'] - to_fill['wheels_off']).astype('timedelta64[m]')
diff = diff.apply(lambda x: x + 24*60 if x < 0 else x)

flights.loc[to_fill.index, 'air_time'] = diff



In [55]:

    
to_fill = flights[flights['arrival_delay'] != flights['arrival_delay']]
diff = (to_fill['arrival_time'] - to_fill['scheduled_arrival']).astype('timedelta64[m]')
diff = diff.apply(lambda x: x + 24*60 if x < 0 else x)

flights.loc[to_fill.index, 'arrival_delay'] = diff



In [56]:

    
to_fill = flights[flights['elapsed_time'] != flights['elapsed_time']]
diff = flights['air_time'] + flights['taxi_in'] + flights['taxi_out']

flights.loc[to_fill.index, 'elapsed_time'] = diff



In [57]:

    
flights[calculated].describe().transpose()









    Out[57]:






  
    
      
      count
      mean
      std
      min
      25%
      50%
      75%
      max
    
  
  
    
      departure_delay
      572641.0
      9.326114
      36.758575
      -45.0
      -5.0
      -2.0
      7.0
      1521.0
    
    
      taxi_out
      572641.0
      16.050209
      8.834925
      1.0
      11.0
      14.0
      19.0
      185.0
    
    
      elapsed_time
      572641.0
      137.337194
      75.215931
      15.0
      82.0
      118.0
      169.0
      1384.0
    
    
      taxi_in
      572641.0
      7.432660
      5.628520
      1.0
      4.0
      6.0
      9.0
      175.0
    
    
      arrival_delay
      572641.0
      4.845046
      40.820702
      -82.0
      -13.0
      -5.0
      8.0
      1499.0
    
    
      air_time
      572641.0
      113.854324
      73.219570
      7.0
      60.0
      94.0
      144.0
      1366.0

6. Filter out rows that contain numeric airport codes

there is a mix of character IATA codes and numeric strings

I'm going to match flights with airports and delete the rows that doesn't match, I reindex the dataset to make the rows in the merged dataset match the rows in the original one (some rows were dropped so this is necessary):



In [59]:

    
unmatched = pd.merge(flights.reset_index()[['index', 'origin_airport']], airports, left_on='origin_airport', right_on='iata_code', how='left').set_index('index')

flights.drop(unmatched[unmatched['iata_code'].isnull()].index, inplace=True)

I have already deleted all the rows, so this passage isn't necessary:



In [60]:

    
unmatched = pd.merge(flights.reset_index()[['index', 'destination_airport']], airports, left_on='destination_airport', right_on='iata_code', how='left').set_index('index')
unmatched[unmatched['iata_code'].isnull()]









    Out[60]:






  
    
      
      destination_airport
      iata_code
      airport
      city
      state
      country
      latitude
      longitude
    
    
      index



In [61]:

    
len(flights)









    Out[61]:





524141

7. BONUS: Write a function and script to correct dates for arrivals that are overnight flights

don't actually run this code (time consuming)

This should be quite similar to the function I wrote above, only checking if the time difference between what I'm looking at and the first time in the row is negative and adding one day in this case.



In [ ]:

    
times2 = ['departure_time', 'wheels_off', 'wheels_on', 'scheduled_arrival', 'arrival_time']

def correct_overnight(time, starttime):
        if time - starttime < 0:
            timestamp = time + pd.Timedelta(1, unit='D')
        else:
            timestamp = time
        return timestamp

for col in times2:
    flights[col] = flights.apply(lambda x: correct_overnight(x[col], x['scheduled_departure']), axis=1)



In [ ]:

    
# from solution: x.replace(day=day)

Cleaning: Legos

colors : This file contains information on LEGO colors, including a unique ID for each color, its name, and approximate RGB value, and whether it's transparent

inventories : This table contains information on inventories, including a unique ID, it's version and the set number.

inventory_parts : This table contains information part inventories, including a unique ID number, the part number, the color of the part, how many are included and whether it's a spare.

inventory_sets : This file contains information on what inventory is included in which sets, including the inventory ID, the set number and the quantity of that inventory that are included.

part_categories : This dataset includes information on the part category (what type of part it is) and a unique ID for that part category.

parts : This dataset includes information on lego parts, including a unique ID number, the name of the part, and what part category it's from.

sets : This file contains information on LEGO sets, including a unique ID number, the name of the set, the year it was released, its theme and how many parts it includes.

themes : This file includes information on lego themes. Each theme is given a unique ID number, a name, and (if it's part of a bigger theme) which theme it's part of.

Schema

1. Efficiently import all data files into a dictionary for easy access



In [62]:

    
tables = ['colors', 'inventories', 'inventory_parts', 'inventory_sets', 'part_categories', 'parts', 'sets', 'themes']
legos = {}

for table in tables:
    legos[table] = pd.read_csv('legos/' + table + '.csv')



In [63]:

    
legos['colors'].head()









    Out[63]:






  
    
      
      id
      name
      rgb
      is_trans
    
  
  
    
      0
      -1
      Unknown
      0033B2
      f
    
    
      1
      0
      Black
      05131D
      f
    
    
      2
      1
      Blue
      0055BF
      f
    
    
      3
      2
      Green
      237841
      f
    
    
      4
      3
      Dark Turquoise
      008F9B
      f

2. Create a table that provides data that can be used to analyze colors by category

We do not want detailed part names, part numbers, or category ID
We also do not want any other ID values
Make sure the color name and category name columns are labeled clearly



In [64]:

    
# Merge colors and inventory_parts and dropping columns
color_cat = pd.merge(legos['colors'], legos['inventory_parts'], left_on='id', right_on='color_id', how='left')
color_cat.drop(['id', 'inventory_id', 'color_id', 'is_spare'], axis=1, inplace=True)
# Merge with parts and dropping columns
color_cat = pd.merge(color_cat, legos['parts'], left_on='part_num', right_on='part_num', how='outer')
color_cat.drop(['part_num', 'name_y'], axis=1, inplace=True)
# Merge with part_categories and dropping columns
color_cat = pd.merge(color_cat, legos['part_categories'], left_on='part_cat_id', right_on='id', how='outer')
color_cat.drop(['part_cat_id', 'id'], axis=1, inplace=True)
# Grouping and counting, setting column names and sorting bu number
color_cat = color_cat.groupby(['name_x', 'rgb', 'is_trans', 'name']).sum().reset_index()
names = ['color_name', 'rgb', 'is_trans', 'category_name', 'number']
color_cat.columns = names
color_cat.sort_values(by='number', ascending=False).head(10)









    Out[64]:






  
    
      
      color_name
      rgb
      is_trans
      category_name
      number
    
  
  
    
      2479
      White
      FFFFFF
      f
      Bricks
      55812.0
    
    
      39
      Black
      05131D
      f
      Plates
      53179.0
    
    
      56
      Black
      05131D
      f
      Technic Pins
      51642.0
    
    
      1773
      Red
      C91A09
      f
      Bricks
      48251.0
    
    
      2500
      White
      FFFFFF
      f
      Plates
      39571.0
    
    
      2534
      Yellow
      F2CD37
      f
      Bricks
      38166.0
    
    
      74
      Blue
      0055BF
      f
      Bricks
      35878.0
    
    
      17
      Black
      05131D
      f
      Bricks
      34969.0
    
    
      1052
      Light Bluish Gray
      A0A5A9
      f
      Plates
      31356.0
    
    
      456
      Dark Bluish Gray
      6C6E68
      f
      Plates
      29883.0

3. Create a table that allows us to analyze set names and their themes

Merge everything in one command
Do not include data that will not help with this analysis (remove all of this in a second command)
- do consider that we might later want to look at specific colors or other part details
Rename columns where necessary



In [71]:

    
# Merging sets and themes and dropping columns
set_themes = pd.merge(legos['sets'], legos['themes'], left_on='theme_id', right_on='id', how='left')\
    .merge(legos['inventories'], on='set_num')\
    .merge(legos['inventory_parts'], left_on='id_y', right_on='inventory_id')
set_themes.drop(['set_num', 'theme_id', 'num_parts', 'id_x', 'id_y', 'is_spare'], axis=1, inplace=True)
# Reset names
set_themes.rename(columns={'name_x': 'set_name',
                           'name_y': 'theme_name'}, inplace=True)
set_themes.head()









    Out[71]:






  
    
      
      set_name
      year
      theme_name
      parent_id
      version
      inventory_id
      part_num
      color_id
      quantity
    
  
  
    
      0
      Weetabix Castle
      1970
      Castle
      411.0
      1
      5574
      29c01
      4
      8
    
    
      1
      Weetabix Castle
      1970
      Castle
      411.0
      1
      5574
      29c01
      15
      6
    
    
      2
      Weetabix Castle
      1970
      Castle
      411.0
      1
      5574
      3001a
      15
      25
    
    
      3
      Weetabix Castle
      1970
      Castle
      411.0
      1
      5574
      3001a
      4
      9
    
    
      4
      Weetabix Castle
      1970
      Castle
      411.0
      1
      5574
      3001a
      1
      4

4. Create a copy of the parts table and modify it to show what sets the parts belong to



In [69]:

    
# Creating the copy
parts_copy = legos['parts'].copy()
# Merging with inventory_parts
parts_copy = pd.merge(parts_copy, legos['inventory_parts'], left_on='part_num', right_on='part_num', how='left')
# Merging with inventories
parts_copy = pd.merge(parts_copy, legos['inventories'], left_on='inventory_id', right_on='id', how='left')
# Merging with sets
parts_copy = pd.merge(parts_copy, legos['sets'], left_on='set_num', right_on='set_num', how='left')
# Dropping and renaming columns
parts_copy.drop(['inventory_id', 'id'], axis=1, inplace=True)
names = ['part_num', 'part_name', 'part_cat_id', 'color_id', 'quantity', 'is_spare', 'version', 'set_num', 'set_name', 'year', 'theme_id', 'num_parts']
parts_copy.columns = names
parts_copy.head()









    Out[69]:






  
    
      
      part_num
      part_name
      part_cat_id
      color_id
      quantity
      is_spare
      version
      set_num
      set_name
      year
      theme_id
      num_parts
    
  
  
    
      0
      0687b1
      Set 0687 Activity Booklet 1
      17
      9999.0
      1.0
      f
      1.0
      970687-1
      Team Challenge Upgrade Kit
      2002.0
      524.0
      95.0
    
    
      1
      0901
      Baseplate 16 x 30 with Set 080 Yellow House Print
      1
      2.0
      1.0
      f
      1.0
      080-1
      Basic Building Set with Train
      1967.0
      366.0
      710.0
    
    
      2
      0902
      Baseplate 16 x 24 with Set 080 Small White Hou...
      1
      2.0
      1.0
      f
      1.0
      080-1
      Basic Building Set with Train
      1967.0
      366.0
      710.0
    
    
      3
      0903
      Baseplate 16 x 24 with Set 080 Red House Print
      1
      2.0
      1.0
      f
      1.0
      080-1
      Basic Building Set with Train
      1967.0
      366.0
      710.0
    
    
      4
      0904
      Baseplate 16 x 24 with Set 080 Large White Hou...
      1
      2.0
      1.0
      f
      1.0
      080-1
      Basic Building Set with Train
      1967.0
      366.0
      710.0

5. Create a copy of the set_themes table created earlier. Modify it to create a new table indicating how many transparent parts there are for each set name



In [73]:

    
set_themes_copy.head()









    Out[73]:






  
    
      
      set_name
      year
      theme_name
      parent_id
      version
      inventory_id
      part_num
      color_id
      quantity
      id
      name
      rgb
      is_trans
    
  
  
    
      0
      Weetabix Castle
      1970
      Castle
      411.0
      1
      5574
      29c01
      4
      8
      4
      Red
      C91A09
      0
    
    
      1
      Weetabix Castle
      1970
      Castle
      411.0
      1
      5574
      3001a
      4
      9
      4
      Red
      C91A09
      0
    
    
      2
      Weetabix Castle
      1970
      Castle
      411.0
      1
      5574
      3002a
      4
      6
      4
      Red
      C91A09
      0
    
    
      3
      Weetabix Castle
      1970
      Castle
      411.0
      1
      5574
      3004
      4
      22
      4
      Red
      C91A09
      0
    
    
      4
      Weetabix Castle
      1970
      Castle
      411.0
      1
      5574
      3005
      4
      2
      4
      Red
      C91A09
      0



In [75]:

    
# Creating the copy
set_themes_copy = set_themes.copy()
# Merging with colors
set_themes_copy = pd.merge(set_themes_copy, legos['colors'], left_on='color_id', right_on='id', how='inner')
# Transforming is_trans column to int
set_themes_copy['is_trans'] = set_themes_copy['is_trans'].apply(lambda x: 1 if x == 't' else 0)
# Grouping and counting
set_themes_copy = set_themes_copy.groupby(['set_name', 'theme_name'])['quantity', 'is_trans'].sum()
# Resetting index and renaming columns
set_themes_copy.reset_index(inplace=True)
names = ['set_name', 'theme_name', 'quantity', 'transparent_quantity']
set_themes_copy.columns = names
set_themes_copy.head()









    Out[75]:






  
    
      
      set_name
      theme_name
      quantity
      transparent_quantity
    
  
  
    
      0
      Scenery and Dagger Trap polybag
      Ninjago
      25
      0
    
    
      1
      Spectre
      Series 14 Minifigures
      7
      1
    
    
      2
      'Where Are My Pants?' Guy
      The LEGO Movie Series
      6
      0
    
    
      3
      1 x 1 Bricks
      Supplemental
      400
      1
    
    
      4
      1 x 1 Bricks with Letters (System)
      Supplemental
      44
      0

	year	month	day	day_of_week	airline	flight_number	tail_number	origin_airport	destination_airport	scheduled_departure	...	arrival_time	arrival_delay	cancellation_reason	air_system_delay	security_delay	airline_delay	late_aircraft_delay	weather_delay
0	2015	1	1	4	US	2013	N584UW	LAX	CLT	30	...	753.0	-10.0	NaN	NaN	NaN	NaN	NaN	NaN
1	2015	1	1	4	AA	1674	N853AA	LAS	MIA	35	...	753.0	-10.0	NaN	NaN	NaN	NaN	NaN	NaN
2	2015	1	1	4	DL	1560	N3743H	ANC	SEA	45	...	451.0	-24.0	NaN	NaN	NaN	NaN	NaN	NaN
3	2015	1	1	4	AA	371	N3GXAA	SEA	MIA	100	...	939.0	1.0	NaN	NaN	NaN	NaN	NaN	NaN
4	2015	1	1	4	AA	115	N3CTAA	LAX	MIA	105	...	839.0	-12.0	NaN	NaN	NaN	NaN	NaN	NaN

	iata_code	airline
0	UA	United Air Lines Inc.
1	AA	American Airlines Inc.
2	US	US Airways Inc.
3	F9	Frontier Airlines Inc.
4	B6	JetBlue Airways

	iata_code	airport	city	state	country	latitude	longitude
0	ABE	Lehigh Valley International Airport	Allentown	PA	USA	40.65236	-75.44040
1	ABI	Abilene Regional Airport	Abilene	TX	USA	32.41132	-99.68190
2	ABQ	Albuquerque International Sunport	Albuquerque	NM	USA	35.04022	-106.60919
3	ABR	Aberdeen Regional Airport	Aberdeen	SD	USA	45.44906	-98.42183
4	ABY	Southwest Georgia Regional Airport	Albany	GA	USA	31.53552	-84.19447

	scheduled_departure	departure_time	departure_delay	taxi_out	wheels_off	scheduled_time	elapsed_time	air_time	distance	wheels_on	taxi_in	scheduled_arrival	arrival_time	arrival_delay
0	30	44.0	14.0	13.0	57.0	273.0	249.0	228.0	2125	745.0	8.0	803	753.0	-10.0
1	35	27.0	-8.0	21.0	48.0	268.0	266.0	238.0	2174	746.0	7.0	803	753.0	-10.0
2	45	31.0	-14.0	25.0	56.0	210.0	200.0	171.0	1448	447.0	4.0	515	451.0	-24.0
3	100	52.0	-8.0	30.0	122.0	338.0	347.0	311.0	2724	933.0	6.0	938	939.0	1.0
4	105	103.0	-2.0	14.0	117.0	286.0	276.0	255.0	2342	832.0	7.0	851	839.0	-12.0

	count	mean	std	min	25%	50%	75%	max
scheduled_departure	662804.0	1334.697796	485.897622	1.0	919.0	1330.0	1735.0	2359.0
departure_time	576651.0	1335.741712	496.812987	1.0	921.0	1330.0	1740.0	2400.0
departure_delay	576651.0	9.665097	37.730697	-45.0	-5.0	-2.0	7.0	1521.0
taxi_out	573757.0	16.055921	8.847659	1.0	11.0	14.0	19.0	185.0
wheels_off	573757.0	1356.680891	498.284208	1.0	935.0	1342.0	1754.0	2400.0
scheduled_time	662798.0	139.527372	74.158283	18.0	85.0	120.0	170.0	718.0
elapsed_time	571395.0	136.861871	74.173957	15.0	82.0	118.0	168.0	718.0
air_time	571395.0	113.388771	72.182318	7.0	60.0	94.0	144.0	669.0
distance	662804.0	802.052239	598.923492	21.0	363.0	632.0	1037.0	4983.0
wheels_on	572641.0	1470.117538	522.510025	1.0	1054.0	1507.0	1911.0	2400.0
taxi_in	572641.0	7.432660	5.628520	1.0	4.0	6.0	9.0	175.0
scheduled_arrival	662804.0	1496.783230	509.641339	1.0	1109.0	1523.0	1922.0	2400.0
arrival_time	572641.0	1475.421535	526.463841	1.0	1058.0	1511.0	1916.0	2400.0
arrival_delay	571395.0	4.404344	39.078085	-82.0	-13.0	-5.0	8.0	1499.0

	scheduled_departure	departure_time	wheels_off	wheels_on	scheduled_arrival	arrival_time
0	00:30:00	00:44:00	00:57:00	07:45:00	08:03:00	07:53:00
1	00:35:00	00:27:00	00:48:00	07:46:00	08:03:00	07:53:00
2	00:45:00	00:31:00	00:56:00	04:47:00	05:15:00	04:51:00
3	01:00:00	00:52:00	01:22:00	09:33:00	09:38:00	09:39:00
4	01:05:00	01:03:00	01:17:00	08:32:00	08:51:00	08:39:00

	scheduled_departure	departure_time	wheels_off	wheels_on	scheduled_arrival	arrival_time
0	2015-01-01 00:30:00	2015-01-01 00:44:00	2015-01-01 00:57:00	2015-01-01 07:45:00	2015-01-01 08:03:00	2015-01-01 07:53:00
1	2015-01-01 00:35:00	2015-01-01 00:27:00	2015-01-01 00:48:00	2015-01-01 07:46:00	2015-01-01 08:03:00	2015-01-01 07:53:00
2	2015-01-01 00:45:00	2015-01-01 00:31:00	2015-01-01 00:56:00	2015-01-01 04:47:00	2015-01-01 05:15:00	2015-01-01 04:51:00
3	2015-01-01 01:00:00	2015-01-01 00:52:00	2015-01-01 01:22:00	2015-01-01 09:33:00	2015-01-01 09:38:00	2015-01-01 09:39:00
4	2015-01-01 01:05:00	2015-01-01 01:03:00	2015-01-01 01:17:00	2015-01-01 08:32:00	2015-01-01 08:51:00	2015-01-01 08:39:00

	count	mean	std	min	25%	50%	75%	max
departure_delay	572641.0	9.326114	36.758575	-45.0	-5.0	-2.0	7.0	1521.0
taxi_out	572641.0	16.050209	8.834925	1.0	11.0	14.0	19.0	185.0
elapsed_time	571395.0	136.861871	74.173957	15.0	82.0	118.0	168.0	718.0
taxi_in	572641.0	7.432660	5.628520	1.0	4.0	6.0	9.0	175.0
arrival_delay	571395.0	4.404344	39.078085	-82.0	-13.0	-5.0	8.0	1499.0
air_time	571395.0	113.388771	72.182318	7.0	60.0	94.0	144.0	669.0

	id	name	rgb	is_trans
0	-1	Unknown	0033B2	f
1	0	Black	05131D	f
2	1	Blue	0055BF	f
3	2	Green	237841	f
4	3	Dark Turquoise	008F9B	f

	color_name	rgb	is_trans	category_name	number
2479	White	FFFFFF	f	Bricks	55812.0
39	Black	05131D	f	Plates	53179.0
56	Black	05131D	f	Technic Pins	51642.0
1773	Red	C91A09	f	Bricks	48251.0
2500	White	FFFFFF	f	Plates	39571.0
2534	Yellow	F2CD37	f	Bricks	38166.0
74	Blue	0055BF	f	Bricks	35878.0
17	Black	05131D	f	Bricks	34969.0
1052	Light Bluish Gray	A0A5A9	f	Plates	31356.0
456	Dark Bluish Gray	6C6E68	f	Plates	29883.0

	set_name	year	theme_name	parent_id	version	inventory_id	part_num	color_id	quantity
0	Weetabix Castle	1970	Castle	411.0	1	5574	29c01	4	8
1	Weetabix Castle	1970	Castle	411.0	1	5574	29c01	15	6
2	Weetabix Castle	1970	Castle	411.0	1	5574	3001a	15	25
3	Weetabix Castle	1970	Castle	411.0	1	5574	3001a	4	9
4	Weetabix Castle	1970	Castle	411.0	1	5574	3001a	1	4

	part_num	part_name	part_cat_id	color_id	quantity	is_spare	version	set_num	set_name	year	theme_id	num_parts
0	0687b1	Set 0687 Activity Booklet 1	17	9999.0	1.0	f	1.0	970687-1	Team Challenge Upgrade Kit	2002.0	524.0	95.0
1	0901	Baseplate 16 x 30 with Set 080 Yellow House Print	1	2.0	1.0	f	1.0	080-1	Basic Building Set with Train	1967.0	366.0	710.0
2	0902	Baseplate 16 x 24 with Set 080 Small White Hou...	1	2.0	1.0	f	1.0	080-1	Basic Building Set with Train	1967.0	366.0	710.0
3	0903	Baseplate 16 x 24 with Set 080 Red House Print	1	2.0	1.0	f	1.0	080-1	Basic Building Set with Train	1967.0	366.0	710.0
4	0904	Baseplate 16 x 24 with Set 080 Large White Hou...	1	2.0	1.0	f	1.0	080-1	Basic Building Set with Train	1967.0	366.0	710.0

	set_name	theme_name	quantity	transparent_quantity
0	Scenery and Dagger Trap polybag	Ninjago	25	0
1	Spectre	Series 14 Minifigures	7	1
2	'Where Are My Pants?' Guy	The LEGO Movie Series	6	0
3	1 x 1 Bricks	Supplemental	400	1
4	1 x 1 Bricks with Letters (System)	Supplemental	44	0